NLP - Disaster Tweets

Description

This Data Challenge is inspired by a prediction competition in Kaggle – Natural Language Processing with Disaster Tweets. The goal for this challenge is to build various machine learning models that predicts which Tweets are about real disasters and which ones are not by predicting (1) for disaster and (0) for non-disaster. For the purpose of this comparisson experiments Natural Language Processing will be applied.

The train dataset that is going to be used consist of 7613 tweets that were hand classified. The columns in both the train and test csv files are “id” - a unique identifier for each tweet, “keyword” - a particular keyword from the tweet (may be blank), “location” - the location the tweet was sent from (may be blank), “text” - the text of the tweet. Furthermore, the train dataset has a column “target” which designate whether a tweet is about a real disaster (1) or not (0).

1. Load data

1.1 Install packages

download glove.6B.100d.txt file

1.2 Imports

2. Data Preparation

2.1 Extract URLs and numbers in separate columns

2.2 Keep http in text

2.3 Prepare keywords column

2. EDA

2.1 Checkout a disaster tweet

2.2 Checkout a non disaster tweet

2.3 Plot null values in train & test dataset

2.4 Show the lenght of a tweet

2.5 Train dataset distribution by Target

2.6 Target distribution in keywords

2.7 Word cloud of disaster and non-disaster tweets to see most repeating word

2.8 Top 10 keywords in the train dataset

2.9 Visualize tweets by location on a map

3. Data Preprocessing

3.1 Clean the tweets

Make text lowercase, remove text in square brackets,remove linkemove punctuation and remove words containing numbers.

3.2 Stopwords

3.3 Tokenization

3.4 Lemmatization

3.5 Transform tokens into sentences

3.6 Wordcloud on the cleaned text

4. Models

4.1 BERT Model

BERT model on 10 epochs

BERT on unlemmatized text

4.2 XGBoost model

With text column only

With 3 features

With all features

4.3 LSTM

With features